Take Home Assignment - Origin Financial

Overview

This report has the goal of show all process developed to analyze the data challenge proposed by the Origin Financial team. It’s organized in four main sessions:

The first session will be Engineering Raw Data where I’will describe all the steps needed to clean raw data. The following step, Exploratory Data Analysis has some descriptive analysis about the data. At this point I’ll guide my analysis to data clustering based on RFM (Recency Frequency and Monetary). After that I’ll show the clustering analysis at session Clustering and some insigths based on it. Finally, I have some notes and future ideas further some conclusion about the challenge.

Engineering Raw Data

Before starting the data analysis, let’s do a quickly review at available data sets and raise needs of pre-engineering.

Customers

## Rows: 803
## Columns: 7
## $ id            <fct> 131d8363-e2a6-4c60-acef-da03e6bfbdc4, 831216da-744b-48fd…
## $ created_at    <fct> 2020-03-11 19:38:35, 2020-05-01 13:09:00, 2020-03-11 13:…
## $ date_of_birth <fct> 1990-08-16, 1933-11-22, 1986-02-02, 1990-01-01, , 1992-0…
## $ gender        <fct> male, , , male, , , , male, male, male, male, male, male…
## $ country       <fct> US, US, US, US, US, US, US, US, US, US, US, US, US, US, …
## $ state         <fct> CA, WA, NY, CA, , CA, , CA, NY, CA, CA, CA, NY, , CA, , …
## $ city          <fct> San Francisco, Tonasket, New York, San Francisco, , Cupe…

In a quick overview, it’s possibly to note that variable created_at and date_of_birth should be modified to timestamp and date, respectivily. Also, we can note that all missing data are represented by empty character space and it will be replaced by NA.

The variables state and city has some strange characters like this: <img src='#' onerror=alert('xss') /> I presume that this behavior is caused by a css operator and it’ll be replaced by NA.

The variable id it’s lowercase, while in the transactions data it’s uppercased, so it’ll be replaced by uppercase.

Transactions

## Rows: 11,059
## Columns: 17
## $ X_id                                     <fct> 4D2119A1-03D9-48F2-99B7-FEDD8…
## $ user_id                                  <fct> 94102846-0B6D-45D8-AFD1-DA80C…
## $ account_id                               <fct> 88F8D694-AFF0-4CFC-BBB6-C89D1…
## $ account_name                             <fct> Chase - Plaid Money Market (4…
## $ description                              <fct> ACH Electronic CreditGUSTO PA…
## $ type                                     <fct> expense, expense, expense, ex…
## $ amount                                   <dbl> -5850, -5850, -5850, -5850, -…
## $ date                                     <fct> 2021-06-08T00:00:00.000Z, 202…
## $ extra_fields.category.0                  <fct> Transfer, Transfer, Transfer,…
## $ extra_fields.category.1                  <fct> Debit, Debit, Debit, Debit, D…
## $ extra_fields.category.2                  <fct> , , , , , , , , , , , , , , ,…
## $ extra_fields.category_id                 <int> 21006000, 21006000, 21006000,…
## $ extra_fields.merchant_name               <fct> , , , , , , , , , , , , , , ,…
## $ extra_fields.name                        <fct> ACH Electronic CreditGUSTO PA…
## $ extra_fields.payment_channel             <fct> other, other, other, other, o…
## $ extra_fields.payment_meta.payment_method <fct> ACH, ACH, ACH, ACH, ACH, ACH,…
## $ created_at                               <fct> 2021-06-18T19:14:32.075Z, 202…

The transaction data doesn’t so much engineering to do. It’s just change timestamp and date types and replace empty spaces by NA. I also, replaced . by _ in some variable names.

Exploratory Data Analysis

The exploratory data analysis it’s divides

## Warning: Couldn't find skimmers for class: POSIXlt, POSIXt; No user-defined
## `sfl` provided. Falling back to `character`.
Data summary
Name db_users
Number of rows 803
Number of columns 8
_______________________
Column type frequency:
character 6
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 36 36 0 803 0
created_at 0 1.00 2414 5633 0 803 0
gender 751 0.06 4 10 0 4 0
country 0 1.00 2 2 0 1 0
state 179 0.78 2 2 0 27 0
city 176 0.78 4 21 0 120 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_of_birth 173 0.78 1000-01-01 2001-01-01 1989-12-12 119

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at 173 0.78 33.58 40.24 19.43 29.65 31.1 34.01 1021.03 ▇▁▁▁▁

At this simple description about the dataset it’s possibly to note that age has a outlier that will be replaced by NA. Also, the great number of missing data in the variable gender, turning it almost useless in the data analysis.

Univariate Analysis

Age

The customers age distribution are concentraded between the range 30 - 50 years as shown bellow:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Maybe this variable could bea good one to use in the clustering process.

State

The geographical distribuiton of customer are really dense in the California State (31%) followed by MD(11.1%) add NC/NY tied with (7.8%). It’s important to note that more than 22% of the customers has no state information.

## Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
## Need '+proj=longlat +datum=WGS84'

Users Transactions

Looking to the user transactions, I note a tiny number of different customers, just 84. Some of them (almost 20%) has more than one account, but as we expecte to analysi customer behavior over account behaviour I’ll treat them as the same customer.

## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
Number of Accounts Number of Customers % of Customers
1 70 83.33
2 7 8.33
3 1 1.19
4 2 2.38
5 2 2.38
6 1 1.19
7 1 1.19

At this point it’s really important to say that just 84 customers to do the cluster analysis could be a problem for a bunch of reasos as like:

  • Some algorithms won’t provide robust results under small samples

  • The cluster interpretation could be a little messy.

  • The pattern recognition it’s hard to do

Sad that i’ll develop my analysis and won’t

Type of transacions

Analysing the type of transactions we can note that has a mix of too different transactions. Income, Expense and Transfer, as I’ll show in the next steps, it will be treated separatelly to do the clustering analysis.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The amount money transactions types has a strange behaviour having low variety of values. Transfer for example has just the value 25. It’s something that could affect the behavior of clusterig analysis.

## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

Looking to number and amount of transactions per day by expenses, transfer and income it’s possibly to note that the days 2021-06-15 (income and expense) and 2021-06-16 (income) has a lot more transactions compare to other days. Expense category it’s almost 10x the median for the period and for income it’s mre than 10x.

I don’t know if this strange behavior comes from a selection bias or if this is from the data. I’ll presume that this behavior is a normal one.

Transactions Category

Looking the transactions category we can note that we have a lot of missing ones, but part of that is caused by income transactions.

Almost 20% of transactions came from travel category, divided into taxi (Uber) and Airlines and Aviation.

Food and drink represent 25% divided in restaurants like coffe shop and fast foods.

Also has some transactions repesented by recreation and shops ans some transfer between accounts

Merchant Name

Looking deeper to the expense types and merchant names, we can see at the top Uber, McDonalds, starbaucks and so on.

Channel Transactions

The preferencial channel to expenses is in store(73%). Would be interesting if we had some information abou online channels as app payment, website shop e etc.

Payment Methods

By payment methods we have just 4% of transactions with this data, it’s almost useless to the analysis.

RFM Transformation

After this exploratory analysis about variables and looking to a solution to clustering problem . I decided transform the transactional data into recency frequency and monetary value for the tree types of transactions: Expense, Income and Transfer.

After this transformation our dataset looks like this:

## Rows: 84
## Columns: 33
## $ id                                      <chr> "AD7226ED-2D26-45FD-AB37-C3823…
## $ state                                   <chr> "NY", "CA", "IL", "NY", NA, "V…
## $ age_at                                  <dbl> 34.12794, 32.54987, 33.36635, …
## $ count_transactions_expense              <dbl> 121, 120, 144, 538, 270, 0, 0,…
## $ amount_expense                          <dbl> 2825.37, 15654.90, 18785.88, 2…
## $ amount_mean_expense                     <dbl> 23.35017, 130.45750, 130.45750…
## $ count_transactions_income               <dbl> 24, 210, 252, 50, 24, 6, 6, 6,…
## $ amount_income                           <dbl> 12000.00, 585798.60, 702958.32…
## $ amount_mean_income                      <dbl> 5.000000e+02, 2.789517e+03, 2.…
## $ count_transactions_transfer             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_transfer                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_mean_transfer                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_transfer                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_food_and_drink           <dbl> 72, 0, 0, 246, 122, 0, 0, 0, 0…
## $ n_transactions_payment                  <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_recreation               <dbl> 0, 0, 0, 50, 24, 0, 0, 0, 0, 0…
## $ n_transactions_shops                    <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_travel                   <dbl> 73, 0, 0, 196, 98, 0, 0, 0, 0,…
## $ transactions_amount_transfer            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_food_and_drink      <dbl> 2537.52, 0.00, 0.00, 53286.50,…
## $ transactions_amount_payment             <dbl> 0.0, 0.0, 0.0, 99768.0, 51962.…
## $ transactions_amount_recreation          <dbl> 0, 0, 0, 3925, 1884, 0, 0, 0, …
## $ transactions_amount_shops               <dbl> 0, 0, 0, 24000, 12500, 0, 0, 0…
## $ transactions_amount_travel              <dbl> 12287.85, 0.00, 0.00, 49573.84…
## $ transactions_amount_mean_transfer       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_mean_food_and_drink <dbl> 35.24333, 0.00000, 0.00000, 21…
## $ transactions_amount_mean_payment        <dbl> 0.0, 0.0, 0.0, 2078.5, 2078.5,…
## $ transactions_amount_mean_recreation     <dbl> 0.0, 0.0, 0.0, 78.5, 78.5, 0.0…
## $ transactions_amount_mean_shops          <dbl> 0, 0, 0, 500, 500, 0, 0, 0, 0,…
## $ transactions_amount_mean_travel         <dbl> 168.3267, 0.0000, 0.0000, 252.…
## $ no_channel                              <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ other                                   <dbl> 0.1655172, 0.0000000, 0.000000…
## $ in_store                                <dbl> 0.8344828, 0.0000000, 0.000000…

Just 84 datapoints (customers) and 33 variables. I drop the recency variables because I tougth that itsn’t the goal of the analysis explore “how much time the customer doens’t have any transaction”.

Expense vs Income

Firstly, analysing the behavior of the amount mean of transactions in expenses and income we can note some behaviours.

Note that some customer has high income mean and low expense mean, while we have others that expend more than earn. This is a good expoiler about the clustering.

It’s important to note the outlier that expend much more than earn, I will keep this customer in analysis because it’s a different behaviour that could be more representative if we have more data.

Expense Vs Income Vs Age

Adding the variable age to our analysis, it’s possibly to note that it doesn’t affect the behavior of income and expense, so this variable wouldn’t be useful in the cluster analysis.

## Warning: Removed 6 rows containing missing values (geom_point).

Expense vs Income vs Transfer

Transfer mean amount it’s another variable that doesn’t affect the income and expense. As shown bellow:

Relationship between categorys of expenses

Looking to the correlation between category expenses it’s possibly to note that every transaction has a high correlation. This analysis isn’t so conclusive because we have just 13 customers with expenses categories.

Also, I will try to do a cluster with this datapoints, the main ideia is to show my way of thinking the problem than try to interpret something. Because just 13 customer isn’t enough data to any analysis.

## `summarise()` has grouped output by 'user_id', 'date'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.

Clustering

At the clustering analysis my goal is to made two type of clusters, one based on incomes and expenses and another one based on the categories of expense. I think this way could give us some conclusions about the data.

It’s important to say that I use just one method of clustering called kmeans, it’s based on centroid distances. This choice was made because the datasets has few datapoints and I would like to maintain the analysis as simple as possibly.

At the beggining of very clustering I will do a analysis to determine the optimal number of cluster, after that I will run the cluster analysis and made some insigths about the clusters responses.

Expense vs Income

Expense vs Income vs Freq Expense

Amount Type Expense

## # A tibble: 3 x 13
##   cluster transactions_amo… transactions_amo… transactions_amo… transactions_am…
##   <fct>               <dbl>             <dbl>             <dbl>            <dbl>
## 1 1                  0.616             0.0972             0.186          0.00699
## 2 2                  0.660             0.0922             0.162          0.00636
## 3 3                  0.0212            0.333              0.198          0.00617
## # … with 8 more variables: transactions_amount_shops_mean <dbl>,
## #   transactions_amount_travel_mean <dbl>,
## #   transactions_amount_mean_transfer_mean <dbl>,
## #   transactions_amount_mean_food_and_drink_mean <dbl>,
## #   transactions_amount_mean_payment_mean <dbl>,
## #   transactions_amount_mean_recreation_mean <dbl>,
## #   transactions_amount_mean_shops_mean <dbl>,
## #   transactions_amount_mean_travel_mean <dbl>

Frequency Type Expense

## # A tibble: 3 x 7
##   cluster n_transactions_t… n_transactions_f… n_transactions_p… n_transactions_…
##   <fct>               <dbl>             <dbl>             <dbl>            <dbl>
## 1 1                  0.0877             0.479            0.107            0.0587
## 2 2                  0.193              0.191            0.0899           0.0201
## 3 3                  0.222              0.368            0.148            0.0741
## # … with 2 more variables: n_transactions_shops_mean <dbl>,
## #   n_transactions_travel_mean <dbl>

Conclusion

Notes

This take home assignment was very challenging because it’s a open problem and I could have a lot of different ways to do the analysis, My choice was explore the RFM customer behavior and at the end just FM.

I choose to explore just one method of clustering to made things simple. But in a work day problem I would test some other method like hierarchical clustering and principally density based clustering like Dbscan and Hdbscan, because this methods has the advantage of not classifing some data points with the ideia of “Not all customers belongs a cluster”.

I would like to thank the Origin team for dedicate time to elaborate this problem and give me the chance show my work.